Pandas Basics

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

The fundamental Pandas data structures:

Pandas Series

Key differences: a numpy array vs. a pandas series

A pandas series with string indexes are like a Python dictionary.

Pandas DataFrame

The following figure defines different components of a dataframe.

Data Indexing and Selection

Key concept: explict vs. implicit index

Integer-location based indexing

iloc[] integer-location based indexing for selection by position. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

The syntax is like iloc[row, column] - starting from 0

label based indexing

loc[]: Access a group of rows and columns by label(s) or a boolean array. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

The syntax is like loc[row, column]

Load and explore data from csv

Exercises

Here are some execerises for you:

Data Manipulation in Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

The fundamental Pandas data structures:

Introduction

When you get a dataset to analyze, it is rare that the data set is clean or in exactly the right form you need. Often you’ll need to perform some data preprocessing/wrangling, e.g., creating some new variables or summaries, filtering out some rows based on certain search criteria, renaming the variables, reordering the observations by some column, etc.

In this notebook, you will learn how to perform a variety of data preprocessing tasks. Here, we will use a dataset on flights departing New York City in 2013.

Data frame with columns

Basic Operations of Data Manipulations

You will learn the five key operations that allow you to solve the vast majority of your data manipulation challenges:

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

Select Rows

Logical operators

As shown above, multiple filtering conditions are combined with “&”: every condition must be true in order for a row to be included in the output.

For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ~ is “not”.

Missing values

It is quite common to have missing values or NaN's in data frames. NaN represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

In Python, if you want to determine if a value is missing, use .isnull():

Sorting

Given a data frame, we often want to sort the rows by a column name, or a set of column names, or more complicated expressions. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

Select Columns

When you work with a dataset with hundreds or even thousands of variables, which is not uncommon, the first challenge is often narrowing in on the variables you’re actually interested in.

Select columns whose name matches regular expression regex.

df.filter(regex='regex')

Add new variables

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns.

Useful creation functions

There are many functions for creating new variables

Data Manipulation in Pandas

In this assignment, you will be working on the same dataframe of flights departing New York City in 2013.

Data frame with columns

Question 1. Selecting rows

From the 'flights' dataframe, find all flights that satisfy the following certain conditions:

Question 2. Sorting

Question 3. Selecting Columns

Use at least three ways to select dep_time, dep_delay, arr_time, and arr_delay from flights.

Question 4. Adding new columns

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers.

For example, 759 means 7:59 and 801 means 8:01. Their difference is not 42 but 2 minutes.

Question 5. Mixing things together

The following questions may require multiple operations above.

Data Summarization

Basic Operations of Data Manipulations

You have learned several key operations that allow you to solve the vast majority of your data manipulation challenges:

These can all be used in conjunction with groupby() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

Data frame with columns

Summary functions

Pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrame columns, Series, GroupBy, and produce single values for each of the groups. When applied to a DataFrame, the result is returned as a pandas Series for each column. Examples:

These summary functions can be applied to all the rows in the dataframe.

Group by

These summary functions are not terribly useful unless we pair them with groupby(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use a summary function on a grouped data frame they’ll be automatically applied “by group”.

What if we want to apply different summary functions to different columns?

Combining multiple operations

Now let's put multiple operators we've learned together. Imagine that we want to explore the relationship between the distance and average delay for each destination.

There are four steps to prepare this data:

Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases.

When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups.

Operations within groups

Grouping is most useful in conjunction with aggregate functions. But you can also do other operations within groups:

Exercise: Data Summarization

Data frame with columns

Exercises

Write python script to answer the following questions.

Other operations within groups

Grouping is most useful in conjunction with aggregate functions. But you can also do other operations within groups:

Assignment - Analyzing the IMDB Top 1000 Movies

In the next few assignments, you will be working with this data set of IMDB top 1000 movies.

Source: https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

Part 1: Data Manipulation

Part 2: Data Summarization

Matplotlib Basics

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Object-oriented vs. MATLAB-style interface

Matplotlib has a traditional MATLAB-style interface and a more powerful object-oriented (OO) interface. Please refer to https://matplotlib.org/stable/tutorials/introductory/lifecycle.html if you want to learn more about the differences.

The main thing to remember is:

Exercises:

Create a line plot for the Sigmoid function having a characteristic "S"-shaped curve or sigmoid curve.

https://en.wikipedia.org/wiki/Sigmoid_function

$S(x)=\frac{1}{1+e^{-x}}=\frac{e^{x}}{e^{x}+1}$

Subplot

Subplots are groups of smaller axes that can exist together within a single figure for each comparison.

Exercises:

Create a plot that overlays four normal distributions:

  1. $\mu$=0, $\sigma$=1
  2. $\mu$=0, $\sigma$=2
  3. $\mu$=0, $\sigma$=3
  4. $\mu$=2, $\sigma$=1

Figure Size, Labels, Legend

Line Style and Color

Marker

https://matplotlib.org/stable/gallery/lines_bars_and_markers/marker_reference.html

Exercise:

Plot the following function of x and y:

Paint it in red and use big diamond markers (D).

Scatter plot

Exercise:

Create a scatter plot of two series, $x$ and $y$, where:

Histograms

Exercises:

Create a figure that overlays four histograms that shows one of the following probability distributions:

  1. Normal distribution: $\mu$=0, $\sigma$=1
  2. Normal distribution: $\mu$=0, $\sigma$=2
  3. Uniform distrubtion between -2 and 2
  4. Exponential distrubtion with rate parameter $\lambda=\frac{1}{\beta}$=2

Hint:

Data Visualization

A picture is woth a thousand words. Data visualization can help us uncover relationships and patterns that are hidden in our data.

First, we will use graphs to answer some car-related question: Do cars with big engines use more fuel than cars with small engines? What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?

This dataframe contains 234 rows and 11 variables:

Creating a Plot

In the mpg dataset, the two main variables of interest are engine size (displ) and fuel efficiency (hwy). They are both continuous variables. We can use a scatterplot to show their relationship.

Scatter plot is often used for correlation analysis between different features. Correlation coefficient is between -1 and 1, representing negative and positive correlations. 0 means there is no liner correlation. Correlation is said to be linear if the ratio of change is constant, otherwise is non-linear.

Adding more variables

In the mpg dataset, there are other variables. How do some of these variables affect the relationship between engine size (displ) and fuel efficiency (hwy)?

For instance, we can add a third variable, like class, to a two dimensional scatterplot to indicate a certain property of objects by color, size, or shape of points.

This is doable in maplotlib but a package named "seaborn" does this much more easily.

Different Chart Types

We can create different types of plots for data visualization.

Can we create a line chart to show the relationship between displ and hwy?

Statistical Transformation

Bar charts

Bar charts seem simple, but they are interesting because they reveal something subtle about plots.

The diamonds dataset contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.

The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut.

On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from?

Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values (i.e., stats) to plot.

Histograms

Similarly, we can use histograms to check the followings:

Histograms for continuous variables

Box plot

adapted from: https://en.wikipedia.org/wiki/Box_plot:

A boxplot displays the dataset based on a five-number summary:

IQR is used to determine outliers, which are points that are either greater than Q3+1.5IQR or less than Q1-1.5IQR.

Plotting with categorical variables

Why are there so few points? Overlapping. This does not show the "density" of data points.

Consider using a stripplot() when at least one variable is categorical.

Pie chart

Not so easy with basic functions. Need to perform some aggregation functions first.

Assignment - Analyzing the IMDB Top 1000 Movies

In the next few assignments, you will be working with this data set of IMDB top 1000 movies.

Source: https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

Part 1: Data Manipulation

Part 2: Data Summarization

Part 3. Data Visualization

You need to complete all data processing/manipulation steps above before visualization.

Exploratory Data Analysis

In any project of data analytics, we start with an important task called exploratory data analysis (EDA), i.e., using transformation, summarization, and visualisation to explore our data in a systematic way, often in an iterative cycle.

In EDA, you:

EDA is more than a formal process with a strict set of rules. Instead, think EDA as a state of mind. Initially, feel free to explore different ideas. Some ideas may be dead ends while some may lead to key insights.

Goal: to develop an understanding of your data.

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large amount of questions. Answering these questions can help you know what insights are contained in your dataset, expose you to new aspects, and increase your chance of making a discovery.

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?

Variation

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life. The best way to understand that pattern is to visualize the distribution of the variable’s values.

Visualizing Distribution

How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous.

The Seaborn package has a variety of methods for visualizaing distributions of data:

https://seaborn.pydata.org/tutorial/distributions.html

Typical values

In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:

Unusual values

Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the y variable from the diamonds dataset.

Covariation

If variation describes the behavior within a variable, covariation describes the behavior between variables.

Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables.

A categorical and a continuous variable

It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable.

These histrogams seem not that useful for comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape, e.g., the "fair" cut in this dataset.

KDE plot

A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate:

Setting the argument common_norm.

There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price!

Boxplot

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:

Compared to histograms, we see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average!

Two continuous variables

If you want to visualise the covariation between two continuous variables, draw a scatterplot.

Two categorical variables

To visualise the covariation between categorical variables, you’ll need to count the number of observations for each combination.

EDA - Titanic

Create an account on kaggle.com and read the overview of the titanic competition at https://www.kaggle.com/c/titanic/overview, do the followings:

Download the training dataset and rename it to titanic-train.csv and load it using pandas

Tidy Data

In this notebook, you will learn a useful way to organize your data, an organisation called tidy data. Getting your data into this format requires some upfront work, but that work pays off in the long term.

Tidy datasets are all alike, but every messy dataset is messy in its own way.

You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables country, year, population, and cases, but each dataset organizes the values in a different way.

These are all representations of the same underlying data, but they are not equally easy to use. The tidy dataset will be much easier to work with.

Given each of the four dataframes, try to write a script to answer the following questions.

What is a tidy dataset?

There are three interrelated rules which make a dataset tidy:

Among the four dataframes above, which one is tidy? df1

Tidy data makes a lot of operations easier.

Reshaping Data

Tidy data is good. Unfortunately, in practice, most data that you will encounter will be untidy in one way or another. Hence, for most real analyses, you’ll need to do some tidying:

  1. Figure out what the variables and observations are.
  2. Resolve one of two common problems:
    • One variable might be spread across multiple columns.
    • One observation might be scattered across multiple rows.

Melt: to gather columns into rows

A common problem is a dataset where some of the column names are not names of variables, but values of a variable.

Take df4a: the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.

Pivot: to spread rows into columns

In contrast, another problem is that an observation is scattered across multiple rows.

For example, take df2: an observation is a country in a year, but each observation is spread across two rows.

Split a column into columns

Missing values

Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:

There are two missing values in this dataset:

An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.

Case Study

The dataset "who.csv" contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.

It is obvious that this dataset is not tidy. So let's tidy it up.

The best place to start is almost always to gather together the columns that are not variables. Let’s have a look at what we’ve got:

According to WHO's data dictionary:

  1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
  2. The next 2~3 letters describe the type of TB:
    • rel stands for cases of relapse
    • ep stands for cases of extrapulmonary TB
    • sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
    • sp stands for cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)
  3. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
  4. The remaining numbers gives the age group. The dataset groups cases into seven age groups:
    • 014 = 0 – 14 years old
    • 1524 = 15 – 24 years old
    • 2534 = 25 – 34 years old
    • 3544 = 35 – 44 years old
    • 4554 = 45 – 54 years old
    • 5564 = 55 – 64 years old
    • 65 = 65 or older

We want to separate the column key to multiple columns.

Now, the dataset is tidy!

Relational Data

In data analysis, you often need to combine multiple data sets to answer the questions that you are interested in.

Collectively, multiple related sets (tables) of data are called relational data. In relational (SQL) databases (DBs), each table is called a relation. Two tables (relations) may have a relationship between each other via a PK (primary key) and a FK (foreign key). It is also not uncommon to have more than two tables related to each other.

To work with relational data, we typically need to from three families of operations:

If you have learned relational databases and SQL (Structured Query Language), you should find many of these concepts and operations familiar.

We will use the nycflights13 package to learn about relational data.

The relationships between these tables are shown in the following diagram:

For nycflights13:

Keys

The variables used to connect each pair of tables are called keys. A key is a variable (or set of varialbes) that uniquely identifies an observation.

For example, each plane is uniquely identified by its tailnum. In other cases, multiple variables may be needed. For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.

There are two types of keys:

Once you've identified the primary keys in your tables, it is good practice to verify that they do indeed uniquely identify each observation.

Sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it.

If a table lacks a primary key or a non-composite key, it’s sometimes useful to add one, e.g., its row number. That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.

Standard Joins

Join two data sets/tables by the PK-FK relationship.

Create two datasets, adf and bdf:

Inner Join

The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal:

Outer Joins

Filter Joins

Set-like Operations

In Pandas, the merge() method can also be used for set-like operations, such as union, intersection, and set-difference. All these operations work with a complete row, comparing the values of every variable.

Take these as examples:

Appending Rows or Columns to Dataframe

When working with multiple dataframes, we often need to combine them by rows or by columns. This is when we need to use the conact() method.

Examples

Let's use merge() on our flights data. For these examples, we’ll make it easier to see what’s going on in the examples by creating a narrower dataframes:

Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with a left join:

Sometimes, we need to join on multiple columns.

Sometimes, the column names from the two dataframes may not match. Then, you need to explicitly specify the columns from each side.

Exercises

Assignment - Analyzing the IMDB Top 1000 Movies

In the next few assignments, you will be working with this data set of IMDB top 1000 movies.

Source: https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

Part 1: Data Manipulation

Part 2: Data Summarization

Done!

Part 3: Data Visualization

Done!

Part 4: Tidy Data

After running all cells above, you should have three dataframes:

Now, let's take a quick look at the three dataframes.

Follow the instructions below and write your code to answer the questions:

To better understand the advantages of tidy data, you will first use the "un-tidy" dataframes alone to answer the next few questions:

Tidying the data

Next, you will further tidy the two dataframes stars and genres.

Let's start with stars.

Next, let's reshape the dataframe genres, which is a little bit more complicated.

Classification Basics

Classification is one of the most useful and popular functions in data mining and machine learning.

Essentially, classificaiton is aimed at building a prediction model that can assign data points to a set of predefined classes, i.e., giving a class label to each data point.

There are several key concepts related to classification:

Titanic Data

We will use the famous Titanic data to illustrate the process of classificaiton.

A Naive Prediction Model

Simply based on the overall survival rate (38.4%), we could build a naive prediction model:

Let's apply this model to the test set.

Submit this file "titanic_submit_allzero.csv" to Kaggle. Check your score and ranking.

Can we do better than this?

Another Simple Prediciton Model

Since female survival rate is 74.2% and male survival rate is 18.9%, we could build another simple prediction model:

Submit this file "titanic_submit_gender.csv" to Kaggle. Check your score and ranking.

Decision Tree for the Titanic Data

A decision tree is a prediction model that uses a tree-like structure of decisions and their possible consequences. It can be used for classificaiton and regression.

The basic idea of a decision tree is to split data set based on the homogeneity of data, i.e., reducing “impurity”.

Entropy

Entropy is one of the most common measures for calculating impurity.

$H(x)=-\sum_{i=1}^np_{i}\log p_{i}$, where $p_{i}$ is the probability of class $i$ in the data.

scikit-learn

Scikit-learn is one of the mose popular Python packages for predictive data analysis.

https://scikit-learn.org/

We will use sklearn for building decision trees.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

You can see this is a huge tree, which may lead to overfitting problem.

Let's set max_depth=3 to generate a simpler tree. Also, set criterion='entropy'.

About Each Node

Make Predictions

Remember the training data has three features ['Pclass', 'SibSp', 'Fare'], we can predict the target based on different values for those three features

Next, we use this simple DT to predict for the test set.

Data Preprocessing for Modeling

In this notebook, we continue working on the Titanic data to predict survival by including more predictor variables. In particular, our code will:

Handling Missing Values

There are two main types of approaches to handle missing values in data:

Processing the Target Column

Save the target column and drop it from the training set. Since the target column has no missing values now and we normally don't need to encode target column (even if it has categorical values), let's save it and then drop it form the training set.

Processing Categorical Variables

So far, we have only included numerical variables in the model. Next, we will handle categorical variables.

Let's pause here and consider this:

Is this ordinal encoder appropriate for encoding these two variables? Why or why not?

Is the onehot encoder a better choice for encoding these two categorical variables?

Build the model

Now we can prepare for the final train dataset and build the model:

We first generated the following dataframes:

Then, we split the df into numerical and categorical dataframes:

Now, we can preparing the final training dataset by combining some of them.

Make Predictions

When we train tree_clf1/tree_clf2 above, titanic_train_encoded/titanic_train_onehot_encoded is the input dataframe we used.

When we make predictions, the input dataframe we feed into the model MUST match the structure (the number of columns with specific order) of titanic_train_encoded/titanic_train_onehot_encoded.

Assignment

Next, you need to follow the same process and apply the second decision tree "tree_clf2" to the test set with the two categorical variables 'Sex' and 'Embarked' encoded by the 'onehot_encoder'. Name the output file "tree2_onhot_submit.csv"

Regression Basics

Regression is one of the most useful and popular functions in data mining and statisitc learning.

Regression is aimed at building a model that can predict the value of y based on X.

There are several key concepts related to regression:

Making predictions using the model

OLS Regression (optional)

You can do linear regression using statsmodels package as follows, which gives you more information (such as R-squared and p-value) from a statistics perspective.

Checkout more at: https://dss.princeton.edu/online_help/analysis/interpreting_regression.htm

Polynomial Regression

Polynomial Regression can fit non-linear data to a linear model by adding powers of each feature as new features and then train a linear model on the extended set of features.

Log Transformation

Regression Models

In our exploratory data analysis (EDA), we’ve seen some surprising relationships between the quality of diamonds and their price: low quality diamonds (poor cuts, bad colours, and inferior clarity) have higher prices.

Do these charts mean lower quality diamonds have higher prices? If that's the case, why do people pay higher prices for lower quality?

Do not forget there is an important confounding variable: the weight (carat) of the diamond. The weight of the diamond is the single most important factor for determining the price of the diamond, and lower quality diamonds tend to be larger.

Build a Simple Regression Model for Diamond Price

We build a simple regression model to predict diamond price by carat.

Improving the model

Next, let's make a few tweaks to our model:

Remove outliers

Adding another variable 'cut'

How to interpret the coefficients?

carat: 7974.56625425 cut_Ideal: 1751.30245509 cut_Premium: 1379.87435057 cut_Very Good: 1452.57396734 cut_Good: 1063.00747392

Encode 'cut' with an ordinal encoder

With the knowledge that the five values of cut, ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], indicate the quality from low to high, we could also encode this variable using an ordinal encoder (0, 1, 2, 3, 4).

How to interpret the coefficients?

carat: 7943.41866224 cut: 256.31917809

Non-linearity

The relationship between carat and price seems to be non-linear.

Exercise

Build an other regression model (e.g., including other variables such as color and clarity, adding categorical variable(s) to the log-transformed model). Test its performance.

OneHot model

Encoded Color

Encoded Clarity

Python

In this course, we will use the Python programming language for all tutorials, exercises, and assignments.

Python is a great general-purpose programming language. With the help of several popular libraries (e.g., numpy, scipy, pandas, matplotlib, sklearn), it provides a powerful environment for data analytics and computing.

It'd be great if you have some experience with Python and numpy. If not, we will take this notebook as a crash course on basics for Python programming and its use for scientific computing.

Many say that Python code is like pseudocode for it can express very powerful ideas in very few lines of code while being very readable.

As an example, here is an implementation of the classic quicksort algorithm in Python:

Basic Data Types

Like most languages, Python has a number of basic types including integers, floats, booleans, and strings.

Numbers: Integers and floats work as you would expect from other languages:

Booleans: Python implements all of the usual operators for Boolean logic:

Strings: Python has great support for strings:

String objects have a bunch of useful methods:

How to extract a substring from a string? You can use the following templates:

Count the occurrence of a character in a string.

For example, count the occurrence of spaces (' ') in the string.

Python List

See https://realpython.com/python-lists-tuples/

A Python list is a collection of arbitrary objects, similar to an array in many other programming languages.

List Comprehension

List comprehension formula:

new_list = [expression (if conditional for changing the value) for member in iterable (if conditional for filtering the value)]

Python Dictionary

A dictionary stores (key, value) pairs.

It is easy to iterate over the keys in a dictionary:

Python Set

Unlike a list, a set is an unordered collection of distinct elements.

Python Tuple

A tuple is an (immutable) ordered list of values. A tuple is in many ways similar to a list; one of the most important differences is that tuples can be used as keys in dictionaries and as elements of sets, while lists cannot.

Tuples can be used as keys in dictionaries and as elements of sets.

Numpy Basics

Referece Chapter: https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html

Creating Arrays from Python Lists

All elements in a NumPy array must be of the same type. If types do not match, NumPy will upcast if possible: integers are up-cast to floating point:

Creating Arrays from Scratch

Numpy Array Indexing and Slicing

To access a slice of an array x, use this:

x[start:stop:step]

The slice extends from the ‘start’ index and ends one item before the ‘stop’ index.

If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.

Numpy slices are views of the array - not copies!!

This default behavior of Numpy is very useful: when we handle large datasets, this allows us to access and process part of the datasset without making copies of the underlying data (could be slow and costly).

If you want to make a copy of the slice, you have to use the copy() method:

Other Useful Methods

axis

axis 0 means row (default), axis 1 means column

Computation on Arrays

NumPy functions

These arithmetic operations are convenient wrappers around specific functions built into NumPy. The following table lists the arithmetic operators implemented in NumPy:

Aggregation functions

Aggregates available in NumPy can be extremely useful for summarizing a set of values.

As a simple example, let's consider the heights (cm) of US presidents.

Assignment: Python Basics

We assume you are using Python 3 in this course.

Question 1

You've learned different data types in Python. Now, let's test your knowledge.

You need to use to format strings.

Write a program using the "f-strings" (https://realpython.com/python-f-strings/) and user input function to convert temperatures from Fahrenheit to Celsius. [Formula: Celsius = (Fahrenheit – 32)*5/9]

Hint: you may need int() or float() function

An example program output:

Please enter the temperature in Celcius: 140
140F is 60.0 in Celsius.

Now, do the oppsoite by convert temperatures from Celsius to Fahrenheit. Another example:

Please enter the temperature in Celcius: 60
60C is 140.0 in Fahrenheit.

Next, let's make it a bit challenging by adding conditional statements to convert temperatures to and from Celsius, Fahrenheit. [Formula: Celsius/5 = (Fahrenheit – 32)/9]

An example program output:

Please enter the temperature: 60
Is this in Celsius or Fahrenheit? C
60C is 140 in Fahrenheit

Another example:

Please enter the temperature: 140
Is this Celsius or Fahrenheit? F
140F is 60 in Celsius

Question 2

Have some fun with strings.

Question 3

Review list comprehension if needed: https://realpython.com/list-comprehension-python/

The formula for list comprehension is: new_list = [expression for member in iterable (if conditional)]

You need to do the following:

Question 4

This question is about Python dictionary.

Vaccine Efficacy
Pfizer 95%
Moderna 95%
AstraZeneca 72%
Johnson & Johnson 66%

Source: https://www.biospace.com/article/comparing-covid-19-vaccines-pfizer-biontech-moderna-astrazeneca-oxford-j-and-j-russia-s-sputnik-v/

Quesion 5

Next, we will practice with Numpy arrays.